1.
overview and scope of application
applicable objects: block storage, object storage and file service nodes deployed in the singapore region (such as aws ap-southeast-1, alibaba cloud singapore, etc.).
goal: ensure availability, predictable capacity, operationalization and automation of alarms. this article uses prometheus/grafana/alertmanager as an example monitoring stack, and includes actual expansion and temporary processing steps.
2.
monitoring item collection and deployment steps (instance level)
steps: 1) install node_exporter on each storage server: sudo apt update && sudo apt install -y prometheus-node-exporter.
2) configure prometheus scrape: add - job_name: 'nodes' static_configs: - targets: ['ip:9100'] to prometheus.yml and restart prometheus. sudo systemctl restart prometheus.
3) collection items: disk usage (/, /data), inode usage, disk latency (iostat or node_exporter disk_latency), network bandwidth, cpu, memory, disk queue length, number of file handles.
3.
object storage and gateway monitoring
steps: 1) for s3-compatible storage, turn on the access log on the storage side, push it to a dedicated bucket and parse it with fluentd/fluent bit and report it to prometheus or send it directly to elasticsearch.
2) key indicators: put/get 4xx/5xx rate, 95/99p response delay, sharding/replication delay, object number growth rate, life cycle hot/cold times.
4.
alarm rules and threshold recommendations (example)
example prometheus rules: 1) disk_usage_percent > 80 for 5m → warning; >90 for 2m → critical.
2) inode_usage > 90% for 5m. 3) disk_io_avg_latency_ms > 50ms for 5m. 4) s3_5xx_rate > 0.5% for 10m.
rule writing reference: alert: diskalmostfull expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes{mountpoint="/data"}) * 100 < 20
5.
alarm routing and receiver configuration
steps: 1) configure routes in alertmanager: route to slack/email/pagerduty/sms by severity, team, and service classification.
2) configure templates and suppression rules (snooze): short-term i/o peaks can be suppressed for 15 minutes.
3) test process: use amtool or curl to trigger a simulated alarm and confirm receipt and carbon copy.
6.
alarm handling (runbook) and quick handling commands
general process: receive an alarm → log in to the affected host → check top/df -h/iostat/vmstat → determine whether it is a sudden increase or a long-term increase.
quickly free up space: 1) clean /var/log: sudo journalctl --vacuum-time=3d; 2) clean temporary directories: sudo rm -rf /tmp/*; 3) delete old backups or migrate to cold storage (example: aws s3 mv /backup s3://cold-bucket --storage-class glacier).
temporary solution for capacity expansion: mount a new disk, rsync the data to the new disk, and update fstab.
7.
capacity planning steps (detailed how-to guide)
1) data collection: export daily used_bytes, object_count, daily_ingest_bytes for the past 90-180 days; you can use prometheus or cloud monitoring api (aws cloudwatch) to export csv.
2) calculate the daily growth rate: use linear regression or find the average daily increment of the last 30 days = (last - first)/days.
3) forecast and safety factor: take 95% of the forecast based on business peaks, and add strategic redundancy of 20%-30% (up to 50% for key businesses).
4) develop a retention and tiering policy: hot storage for 30 days, cold storage for 90-365 days and enable automatic transfer of life cycle rules. documented and registered in cmdb.
8.
capacity expansion operation (block storage/cloud disk and file system)
cloud disk expansion (taking aws as an example): 1) aws ec2 modify-volume --volume-id vol-xxx --size 200 --region ap-southeast-1.
2) check on the instance: sudo lsblk, if you need to expand the partition: sudo growpart /dev/xvdf 1; then expand the file system: for xfs sudo xfs_growfs /mountpoint; for ext4 sudo resize2fs /dev/xvdf1.
add a new disk and migrate: mount the new disk → rsync -av /data/ /mnt/newdata/ → modify fstab → restart the service and gradually switch.
9.
q&a 1
question: how to prevent abnormal 5xx alarms of object storage from being falsely reported in the singapore region?
answer: the key is to set short-term suppression and percentage thresholds: use the 5xx request rate (5xx_count / total_requests) as an indicator, and configure a threshold such as >0.5% for 10 minutes as an alarm. at the same time, false alarms caused by short-term deployment are suppressed (silent when deploy_tag=true), and the request delay and back-end error rate are combined to determine whether it is a real fault.
10.
q&a 2
question: what historical window is more accurate for capacity forecasting?
answer: a window of 90 to 180 days is usually used to take into account seasonality and recent trends. for rapidly growing businesses, the 30-day growth rate and the 90-day growth rate can be calculated in parallel, taking conservative values and retaining 20%-30% redundancy. temporary adjustments are required when there are promotions or migration windows.
11.
question 3
question: what should be the first step when the disk suddenly receives a high io alarm?
answer: the first step is to check the traffic and process: log in to the host and execute iostat -x 1 5, iotop, ps aux --sort=-%cpu to determine whether it is caused by backup/scan/batch processing; if it is an expected task, prioritize speed limiting or migration tasks; if it is an abnormal write, find the large file generator and temporarily stop the service. if necessary, remove the hot data to the cold disk.

- Latest articles
- Operation And Maintenance Manual What Are The Monitoring Alarms And Capacity Planning Recommendations For Singapore Cloud Storage Servers?
- How To Choose A Suitable American Game Server Host To Ensure Stable Gaming
- How To Establish Supply Chain And Partnership In Qoo10 Japan Website Seller Communication Group Wechat
- How To Implement Cost-saving Techniques In The Unlimited Use Of Vps In Malaysia
- Preferential Activity Express Vietnam Vps Official Website Entrance Investment Promotion And Limited Time Discount Guide
- Competitive Product Monitoring And Price War Response Strategies In The Wechat Seller Communication Group Of Qoo10 Japanese Website
- A Collection Of Real-life Experiences Among Gamers Discussing Whether Qiyou Cloud Server Can Be Used In Japan
- The Stability And Expansion Strategy Of The American Cn2 Independent Server In High Concurrency Scenarios
- Analysis Of The Advantages Of Korean Private Vps In Terms Of Data Security And Independent Ip
- Why Do Companies Choose Taiwan Servers, Referred To As Cloud Hosts, As Their Preferred Overseas Deployment Solution?
- Popular tags
-
Advantages And Applicable Scenarios Of Lightweight Cloud Servers In Singapore
this article discusses the advantages and applicable scenarios of lightweight cloud servers in singapore, covering server configuration, real cases and data demonstration. -
Can Singapore Vps Connect To Google? Analyze The Secrets Of Network Access
in-depth analysis of the issue of whether singapore vps can access google, and explores the secrets and solutions of network access. -
Advantages And Usage Experience Of Huawei Cloud Servers In Singapore
discuss the server advantages and usage experience of huawei cloud in singapore, including performance, stability, security, etc., and recommend dexun telecommunications.